Skip to content

Fix FRoGS compatibility on modern Python / TF 2.x / macOS#2

Open
chronicgiardia wants to merge 1 commit intochenhcs:mainfrom
chronicgiardia:fix/modernize-compat-macos
Open

Fix FRoGS compatibility on modern Python / TF 2.x / macOS#2
chronicgiardia wants to merge 1 commit intochenhcs:mainfrom
chronicgiardia:fix/modernize-compat-macos

Conversation

@chronicgiardia
Copy link
Copy Markdown

Summary

Modernize the FRoGS codebase so it runs cleanly on Python 3.9–3.12, TensorFlow 2.x, and macOS. The original symptoms were: stale six and __future__ imports, multiprocessing hangs on macOS, scripts expecting an uncompressed L1000_PhaseI_and_II.csv even though the shipped dataset is gzipped, private tensorflow.python.keras imports, and a Dense(hid_dim/4) float-division bug.

Changes

Multiprocessing (src/utils/parallel.py)

  • Rewrote parallel.map on top of concurrent.futures.ProcessPoolExecutor.
  • Default to the fork start method on macOS so worker processes inherit module-level globals (go_emb, archs4_emb, …) that FRoGS worker functions rely on — this was the root cause of hangs on macOS.
  • Dropped legacy six / __future__ imports.
  • Added FROGS_N_CPU and FROGS_MP_START env overrides.
  • Aggregate worker exceptions into a single RuntimeError instead of silently deadlocking.
  • Optional tqdm progress when available.
  • Legacy MP / parmap / parprep retained for backwards compatibility.

Auto .gz handling (src/utils/io_utils.py, new)

  • read_csv_auto() transparently resolves .csv ⇄ .csv.gz.
  • validate_required_files() reports every missing input at once.
  • Wired into l1000_model.py and l1000_inference.py for sig_file, target_file, and term2gene_id.csv.

TensorFlow 2.x compatibility

  • Removed private from tensorflow.python.keras import backend as K imports from l1000_model.py, l1000_inference.py, gene_vec_model.py, snp_gene_vec_model.py (K was unused).
  • layers.Dense(hid_dim/4)layers.Dense(hid_dim // 4).
  • Warn at import time if tf.__version__ is < 2.x.

Error handling and usability

  • Up-front file validation with a single consolidated error in both L1000 scripts.
  • ensure_dir() auto-creates outdir / modeldir.
  • parallel.map(..., progress=True) wired into the embedding step so tqdm bars render.
  • Fixed four literal '\\n' output writes in snp_gene_vec_model.py.
  • New requirements.txt pinning compatible ranges for numpy, pandas, scikit-learn, scipy, tensorflow, tqdm, matplotlib, requests, goatools.

Validation

  • python -m py_compile passes on every edited file.
  • New demo/smoke_test.py exercises parallel.map (sequential + parallel) and read_csv_auto; passes end-to-end against the shipped data/L1000_PhaseI_and_II.csv.gz.
  • After installing dependencies (celldega + tensorflow>=2.18 + goatools in a modern env), TF 2.21 / Keras 3.14 / goatools 1.6 all import cleanly alongside every modified FRoGS module.

Full model training was intentionally not re-run (epochs=60 / 3000 requires GPUs and many hours); hyperparameters and data flow are unchanged, so results should match the original paper when run end-to-end.


Conversation: https://app.warp.dev/conversation/a6fff60d-aa3f-48fa-bb84-83276f2a8b6c
Plan: https://app.warp.dev/drive/notebook/BEMUZDM4kU4xnq0cBEopkj

Co-Authored-By: Oz oz-agent@warp.dev

Modernize the codebase so FRoGS runs cleanly on Python 3.9-3.12,
TensorFlow 2.x, and macOS.

Multiprocessing (src/utils/parallel.py)
- Rewrite parallel.map on top of concurrent.futures.ProcessPoolExecutor.
- Default to the 'fork' start method on macOS so worker processes inherit
  module-level globals (go_emb, archs4_emb, ...) that the FRoGS worker
  functions rely on; this was the root cause of hangs on macOS.
- Drop legacy 'six' / __future__ imports.
- Add FROGS_N_CPU and FROGS_MP_START env overrides.
- Aggregate worker exceptions into a single RuntimeError instead of
  silently deadlocking.
- Optional tqdm progress bar when available.
- Retain legacy MP/parmap/parprep helpers for backwards compatibility.

Auto .gz data handling (src/utils/io_utils.py, new)
- read_csv_auto() transparently resolves .csv <-> .csv.gz.
- validate_required_files() reports every missing input at once.
- Wired into l1000_model.py and l1000_inference.py for sig_file,
  target_file, and term2gene_id.csv.

TensorFlow 2.x compatibility
- Remove private 'from tensorflow.python.keras import backend as K'
  imports from l1000_model.py, l1000_inference.py, gene_vec_model.py,
  and snp_gene_vec_model.py (K was unused).
- Fix layers.Dense(hid_dim/4) -> layers.Dense(hid_dim // 4).
- Warn at import time if tf.__version__ is < 2.x.

Error handling and usability
- Up-front file validation in both L1000 scripts with a single
  consolidated error message (including .gz alternates).
- ensure_dir() auto-creates outdir/modeldir.
- parallel.map(..., progress=True) wired into the embedding step so
  tqdm bars render during long runs.
- Fix four literal '\\n' output writes in snp_gene_vec_model.py that
  would have emitted a backslash-n instead of a newline.
- Add requirements.txt pinning compatible ranges for numpy, pandas,
  scikit-learn, scipy, tensorflow, tqdm, matplotlib, requests,
  goatools.

Validation
- py_compile passes on every edited file.
- New demo/smoke_test.py exercises parallel.map (seq + par) and
  read_csv_auto; passes end-to-end against the shipped data file.

Co-Authored-By: Oz <oz-agent@warp.dev>
Copilot AI review requested due to automatic review settings April 23, 2026 03:24
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Modernizes FRoGS to run on current Python (3.9–3.12), TensorFlow 2.x, and macOS by updating multiprocessing behavior, adding transparent .gz data handling, and removing outdated TensorFlow/private imports.

Changes:

  • Replaced legacy multiprocessing map logic with a ProcessPoolExecutor-based parallel.map, with env overrides and optional progress display.
  • Added io_utils.py to transparently resolve .csv.csv.gz, validate required inputs up-front, and ensure output directories exist; wired into L1000 scripts.
  • Updated TF/Keras usage (removed private imports, fixed Dense(hid_dim/4) float division), added a smoke test and dependency ranges.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/utils/parallel.py New parallel.map implementation with ProcessPoolExecutor, env overrides, progress, and aggregated errors.
src/utils/io_utils.py New helpers for .gz-aware path resolution, CSV loading, input validation, and directory creation.
src/l1000_model.py Adds TF version warning, up-front file validation, .gz-aware CSV reading, and progress-enabled parallel embedding.
src/l1000_inference.py Mirrors L1000 model changes: TF warning, validation, .gz-aware CSV reading, progress-enabled parallel embedding.
src/gene_vec_model.py Removes private TensorFlow backend import.
src/snp_gene_vec_model.py Adds SNP-oriented training script (but currently references missing modules).
requirements.txt Introduces dependency version ranges for modern environments.
demo/smoke_test.py Adds a lightweight script to exercise new parallel + .gz CSV behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/utils/parallel.py
Comment on lines +46 to +51
if sys.platform == "darwin":
try:
return multiprocessing.get_context("fork")
except ValueError:
pass
return multiprocessing.get_context()
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On macOS this forces the multiprocessing start method to fork. In this repo, both src/l1000_model.py and src/l1000_inference.py import TensorFlow before calling parallel.map, and forking after importing TensorFlow (or other multi-threaded native runtimes) is not considered safe and can crash/hang. If the goal is to preserve module-level globals, consider providing an initializer/setup step for workers under spawn/forkserver, or at least documenting this caveat and defaulting to the platform default unless the user opts into fork via FROGS_MP_START.

Copilot uses AI. Check for mistakes.
Comment thread src/l1000_model.py
Comment on lines +300 to +312
# Validate all required inputs up-front so the user learns about every
# missing file at once rather than one failure at a time.
ok, missing = validate_required_files([cpdlist_file, target_file, sig_file, emb_go, emb_archs4])
if not ok:
sys.stderr.write(
"ERROR: the following required input files could not be found\n"
"(a .gz counterpart was also tried for each):\n - "
+ "\n - ".join(missing)
+ "\n"
)
sys.exit(2)
ensure_dir(outdir)
ensure_dir(modeldir)
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validate_required_files() treats a .gz sibling as acceptable, but the script continues using the original paths (e.g., open(emb_archs4, ...) and open(cpdlist_file) below). This means validation can succeed when only *.gz exists, and then the script still fails at the first open(...). Consider resolving each validated path (e.g., via resolve_data_path) and reassigning cpdlist_file/emb_go/emb_archs4/etc. to the resolved filename (and using gzip.open when appropriate), or limiting validation to the exact filenames you later open directly.

Copilot uses AI. Check for mistakes.
Comment thread src/l1000_inference.py
Comment on lines +213 to +223
ok, missing = validate_required_files([cpdlist_file, sig_file, emb_go, emb_archs4])
if not ok:
sys.stderr.write(
"ERROR: the following required input files could not be found\n"
"(a .gz counterpart was also tried for each):\n - "
+ "\n - ".join(missing)
+ "\n"
)
sys.exit(2)
ensure_dir(outdir)

Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as in l1000_model.py: validate_required_files() will accept *.gz alternates, but the code later calls open(emb_archs4, ...), open(emb_go, ...), and open(cpdlist_file) using the original (possibly missing) paths. To avoid “validated but still FileNotFoundError”, resolve and use the actual existing path(s) after validation.

Copilot uses AI. Check for mistakes.
Comment thread src/snp_gene_vec_model.py
Comment on lines +27 to +32
from utils.snp_utils import SNPProcessor, create_snp_associations
from utils.snp_validation import SNPValidator
import logging

logger = logging.getLogger(__name__)

Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module imports utils.snp_utils and utils.snp_validation, but those modules are not present under src/utils/ in this PR (current src/utils contains only io_utils.py, parallel.py, random_walk.py, sampling_util.py). As-is, importing/running this script will raise ModuleNotFoundError. Either add the missing utility modules in this PR, or remove/guard these imports and any SNP-specific code paths.

Suggested change
from utils.snp_utils import SNPProcessor, create_snp_associations
from utils.snp_validation import SNPValidator
import logging
logger = logging.getLogger(__name__)
import logging
logger = logging.getLogger(__name__)
try:
from utils.snp_utils import SNPProcessor, create_snp_associations
from utils.snp_validation import SNPValidator
_SNP_IMPORT_ERROR = None
except ModuleNotFoundError as exc:
_SNP_IMPORT_ERROR = exc
logger.warning(
"Optional SNP utility modules are unavailable; SNP-specific functionality "
"will raise an error when used: %s",
exc
)
class SNPProcessor: # type: ignore[no-redef]
def __init__(self, *args, **kwargs):
raise ModuleNotFoundError(
"SNP support requires utils.snp_utils, which is not available in this "
"build. Add the missing SNP utility modules or avoid SNP-specific code paths."
) from _SNP_IMPORT_ERROR
class SNPValidator: # type: ignore[no-redef]
def __init__(self, *args, **kwargs):
raise ModuleNotFoundError(
"SNP support requires utils.snp_validation, which is not available in "
"this build. Add the missing SNP utility modules or avoid SNP-specific "
"code paths."
) from _SNP_IMPORT_ERROR
def create_snp_associations(*args, **kwargs): # type: ignore[no-redef]
raise ModuleNotFoundError(
"SNP support requires utils.snp_utils, which is not available in this "
"build. Add the missing SNP utility modules or avoid SNP-specific code paths."
) from _SNP_IMPORT_ERROR

Copilot uses AI. Check for mistakes.
Comment thread src/snp_gene_vec_model.py
Comment on lines +236 to +251
# Demonstrate usage with signature embedding
print("\nTesting SNP signature embedding...")
from snp_signature_embedding import SNPSignatureEmbedder

# Load the newly trained embeddings
embedder = SNPSignatureEmbedder(go_emb_file=embedding_file,
archs4_emb_file='../data/gene_vec_archs4_256.csv')

# Test with example SNPs
test_snps = ['rs2230317', 'rs10273927', 'rs7398691']
test_effects = {'rs2230317': 0.05, 'rs10273927': -0.03, 'rs7398691': 0.08}

embedding = embedder.compute_snp_signature_embedding(test_snps, test_effects)
print(f"Test signature embedding shape: {embedding.shape}")
print(f"Sample embedding values: {embedding[:5]}")

Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from snp_signature_embedding import SNPSignatureEmbedder refers to a module that is not present in this repository/PR (there is src/signature_embedding.py, but no src/snp_signature_embedding.py). This will fail at runtime when --datatype snp is used. Consider adding the missing module, importing the correct existing module, or removing this demo block from the script entry point.

Suggested change
# Demonstrate usage with signature embedding
print("\nTesting SNP signature embedding...")
from snp_signature_embedding import SNPSignatureEmbedder
# Load the newly trained embeddings
embedder = SNPSignatureEmbedder(go_emb_file=embedding_file,
archs4_emb_file='../data/gene_vec_archs4_256.csv')
# Test with example SNPs
test_snps = ['rs2230317', 'rs10273927', 'rs7398691']
test_effects = {'rs2230317': 0.05, 'rs10273927': -0.03, 'rs7398691': 0.08}
embedding = embedder.compute_snp_signature_embedding(test_snps, test_effects)
print(f"Test signature embedding shape: {embedding.shape}")
print(f"Sample embedding values: {embedding[:5]}")
# Demonstrate usage with signature embedding when the optional module is available
print("\nTesting SNP signature embedding...")
try:
from snp_signature_embedding import SNPSignatureEmbedder
# Load the newly trained embeddings
embedder = SNPSignatureEmbedder(
go_emb_file=embedding_file,
archs4_emb_file='../data/gene_vec_archs4_256.csv'
)
# Test with example SNPs
test_snps = ['rs2230317', 'rs10273927', 'rs7398691']
test_effects = {'rs2230317': 0.05, 'rs10273927': -0.03, 'rs7398691': 0.08}
embedding = embedder.compute_snp_signature_embedding(test_snps, test_effects)
print(f"Test signature embedding shape: {embedding.shape}")
print(f"Sample embedding values: {embedding[:5]}")
except ModuleNotFoundError:
logger.warning(
"Skipping SNP signature embedding demo because "
"'snp_signature_embedding' is not available."
)

Copilot uses AI. Check for mistakes.
Comment thread src/utils/parallel.py
Comment on lines +451 to +452
If set, each individual task must complete within ``timeout``
seconds or a :class:`concurrent.futures.TimeoutError` is raised.
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout is documented as a per-task limit, but the implementation passes it to concurrent.futures.as_completed(..., timeout=timeout), which applies to the entire iterator (overall wall time) and does not enforce a per-future timeout (since fut.result() is called without a timeout). This can lead to confusing behavior (and if a task hangs, a TimeoutError may still leave the executor waiting on shutdown). Consider either (a) enforcing per-task timeouts via fut.result(timeout=timeout) (and cancelling/shutting down remaining futures on timeout), or (b) renaming/re-documenting this parameter as an overall timeout.

Suggested change
If set, each individual task must complete within ``timeout``
seconds or a :class:`concurrent.futures.TimeoutError` is raised.
If set, the parallel operation as a whole must complete within
``timeout`` seconds or a :class:`concurrent.futures.TimeoutError`
is raised.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants